1 Introduction

In this report, we are looking at the monthly industrial production of electric and gas utilities from the year 2000 to 2021 in the United States.In United States, the industrial production increases every year.The data is all about the real production of all important enterprises in the United States, regardless of who owns them. We analyze monthly changes in the production between 2000 to 2021. In this project, we analyze the relevant time series analysis modeling to accurately predict the time series for the next 10 years.

2 Data Description

3 Data Preparation

eg<-read.csv("electric.csv")
egts = matrix(eg$IPG2211A2N)
egts = as.vector(t(egts))
egts = ts(egts,start=c(2000,1),end=c(2021,1),frequency = 12)
class(egts)
## [1] "ts"

4 Model Descriptive Analysis

For further analysis, the time series plot and scatter plot is generated to understand the five major points associated with the series and the correlation with subsequent years respectively. We also plot the time series plot with month label to understand the seasonal effect in the series.

plot(egts,type='o',ylab='Electric & Gas Production',xlab='Year',main='Time Series plot of Industrial Production-Figure 1')

plot(egts,xlab='Year',type='l',ylab='Electric & Gas Production',main='Time series Plot with Month label - Figure 2')
points(y=egts,x=time(egts), pch=as.vector(season(egts)))

From the above plots(Fig 1 and Fig 2), we can discuss the major 5 characteristics of the series:

1.Trend: Here, we can see a slight upward trend in the series which shows that the energy production is increasing over the years.

2.Seasonality: From the figure 2, we can observe seasonality in the series i.e., there are repeating patterns in the series.

3.Changing Variance: We can see that there is no change in the variance.

4.Behavior: Here, we cannot say about the moving average and autoregressive behavior of the series due to the seasonality.

5.Change point: There is no changing point or intervention point on the plot.

From Fig 2 we can observe that the production is high during January months and low during August months.
To discuss and understand about the relationship between the neighboring pairs, we can generate a scatter plot. And the correlation between the numerical values can be calculated by using the index method.

plot(y=egts,x=zlag(egts),xlab='Lag',ylab='Electric & Gas Production',main='Scatter plot of Industrial Production - Figure 3') 

y = egts
x=zlag(egts)
index = 2:length(x)
cor(y[index],x[index])
## [1] 0.5576643

From figure 3 and the correlation factor(~56%) we can observe that the correlation between the succeeding year and the production of one month is not that high. The scatter plot has a positive linear trend but the points are scattered around.

5 Model Fitting

The time series object is analyzed for finding the best model fit. We consider cosine and seasonal trends to find the bet fitting model.

5.1 Harmonic/Cosine Trend Model

Since there is a seasonal trend in the series we analyze the dataset with harmonic trend model to understand the fit.

5.1.1 Summary Statistics

For getting the summary statistics we use the harmonic() function, where we need to add the time series object and an m value which is the number of pairs of harmonic functions to be created; 2m must be less than or equal to the frequency of x.

harm.=harmonic(egts,m=1) # calculate cos(2*pi*t) and sin(2*pi*t)
eghm=lm(egts~harm.)
summary(eghm)
## 
## Call:
## lm(formula = egts ~ harm.)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.8574  -8.1198  -0.2243   7.0806  26.2415 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      100.2236     0.6073 165.033  < 2e-16 ***
## harm.cos(2*pi*t)   2.4420     0.8572   2.849  0.00475 ** 
## harm.sin(2*pi*t)  -1.8000     0.8605  -2.092  0.03747 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.659 on 250 degrees of freedom
## Multiple R-squared:  0.04759,    Adjusted R-squared:  0.03997 
## F-statistic: 6.246 on 2 and 250 DF,  p-value: 0.002254

The p value is less than the significant value of 0.05 which signifies that the harmonic fit can be significant but the R2 value is 0.04 which is very low. This implies that around 40% of the variation in the time series is explained by the harmonic model.

5.1.2 Residual Analysis

Residual Analysis is performed to validate the model.

5.1.2.1 Time Series Plot of Standardized Residuals

eghm.res<-rstudent(eghm)
plot(eghm.res,x=as.vector(time(egts)),xlab='Time', ylab='Standardized Residuals',type='o',main="Time series Plot of Residuals - Figure 4")

From the fig 4, we can observe that most of the points are around the mean value for the residuals.

5.1.2.2 Histogram of Standardized Residuals

hist(eghm.res,xlab='Standardized Residuals', main = "Histogram of the standardized residuals - Figure 5", freq = FALSE, ylim = c(0,0.6))
curve(dnorm(x, mean=mean(eghm.res), sd=sd(eghm.res)), col="red", lwd=2, add=TRUE, yaxt="n")

There histogram shows that the residual’s normality is less with major data on the left side.

5.1.2.3 Q-Q Plot of Standardized Residuals

qqnorm(eghm.res, main="Q-Q Plot for the Linear Model- Figure 6")
qqline(eghm.res, col="red")

From QQ plot we observe that many points are not in line with normality.

5.1.2.4 Shapiro-Wilk Test

Shapiro-Wilk test calculates the correlation between the residuals and the corresponding normal quantiles.

shapiro.test(rstudent(eghm))
## 
##  Shapiro-Wilk normality test
## 
## data:  rstudent(eghm)
## W = 0.97301, p-value = 9.899e-05

The p-value is less than the alpha significance and hence we reject the null hypothesis and conclude that the time series is not normally distributed.

5.1.2.5 Auto-correlation Function Plot of Standard Residuals

acf(eghm.res,main="ACF of Standardized Residuals - Figure 7")

The ACF plot follows a slow decaying pattern and at all points the correlation value is significantly higher than the limits. This implies that the time series is non-stationary.

5.2 Seasonal Model

As the series is monthly and has a seasonality, we fit it with the seasonal model.

5.2.1 Summary Statistics

month.=season(egts) # period added to improve table display and this line sets up indicators
egsm=lm(egts~month.,-1) # -1 removes the intercept term
summary(egsm)
## 
## Call:
## lm(formula = egts ~ month., data = -1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.8547  -2.8084   0.6298   2.9711  13.7497 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      115.157      1.022 112.676  < 2e-16 ***
## month.February    -9.469      1.462  -6.475 5.28e-10 ***
## month.March      -16.686      1.462 -11.410  < 2e-16 ***
## month.April      -27.332      1.462 -18.689  < 2e-16 ***
## month.May        -25.828      1.462 -17.661  < 2e-16 ***
## month.June       -15.900      1.462 -10.872  < 2e-16 ***
## month.July        -6.477      1.462  -4.429 1.44e-05 ***
## month.August      -6.556      1.462  -4.483 1.14e-05 ***
## month.September  -17.558      1.462 -12.006  < 2e-16 ***
## month.October    -25.100      1.462 -17.163  < 2e-16 ***
## month.November   -21.870      1.462 -14.954  < 2e-16 ***
## month.December    -7.023      1.462  -4.802 2.76e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.794 on 241 degrees of freedom
## Multiple R-squared:  0.7739, Adjusted R-squared:  0.7636 
## F-statistic: 74.98 on 11 and 241 DF,  p-value: < 2.2e-16

The p value is less than the significant alpha level of 0.05 which implies the seasonal model is significant and has a R2 is 76.36% which means that 76% variation of the time series is explained by the seasonal trend.

5.2.2 Residual Analysis

5.2.2.1 Time Series Plot of Standardized Residuals

res.egsm = rstudent(egsm)
plot(ts(fitted(egsm)), ylim = c(min(c(fitted(egsm), as.vector(egts))), max(c(fitted(egsm),as.vector(egts)))),ylab='y',
     main = "Fitted seasonal model. Figure 8",type='l', col = "red")
lines(as.vector(egts))

The seasonal trend is observed in the time series plot.

5.2.2.2 Histogram of Standardized Residuals

hist(res.egsm,xlab='Standardized Residuals', main = "Histogram of standardised residuals.Figure 9")

From the figure 9, we can see that the histogram is somewhat in a symmetric form with a left skew distribution.

5.2.2.3 Q-Q Plot of Standardized Residuals

qqnorm(y=res.egsm, main = "QQ plot of standardised residuals.Figure 10")
qqline(y=res.egsm, col = 2, lwd = 1, lty = 2)

From the figure 10, we can observe that the Q-Q plot is not exactly like a straight line. So now we can say that the data does not follows a normal distribution.

5.2.2.4 Shapiro-Wilk Test

shapiro.test(res.egsm)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.egsm
## W = 0.98228, p-value = 0.003057

The p-value is less than the significant alpha, 0.05 and hence we reject the null hypothesis that the stochastic component of this model is normally distributed.

5.2.2.5 Auto-correlation Function Plot of Standard Residuals

acf(res.egsm,main="ACF of Standardized Residuals - Figure 11")

From ACF plot we can observe that it follows a slow decaying pattern and all lags are above the limit.

6 Log Transformation

Using the log transformation method we transform the non-stationary series into stationary.

eglog<-log(egts)
plot(eglog, type='o',main=' Time Series plot of log transformed data Figure 12')

There is very slight difference in the time series plot while comparing the original one but not enough details. We generate ACF and PACF plots for the log-transformed data

acf(eglog, main = ' ACF plot for log-transformed data Figure 13')

pacf(eglog, main = ' PACF plot for log-transformed data Figure 14')

ACF plot has slow decaying pattern at seasonal lag and PACF has slow decay pattern.

7 ADF Test

The Augmented Dickey-Fuller Test is used to test the null hypothesis that the process is non-stationary but becomes stationary after first differencing and the alternative hypothesis is that the process is stationary.

adf.test(eglog)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  eglog
## Dickey-Fuller = -5.3233, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

From the above result, we can see that the p-value is 0.01 which is less than the significant alpha value 0.05 and therefore, we reject the null hypothesis.

8 Seasonal ARIMA Model(SARIMA)

To deal with the seasonality in the time series, we need to go through the Seasonal ARIMA models.

A seasonal ARIMA model is formed by including additional seasonal terms in the ARIMA models.

SARIMA models have two sets of orders (p,d,q) for the regular part of the model and (P,D,Q) for the seasonal part of the model and a frequency. So we show the model as *SARIMA(p,d,q)×(P,D,Q)_s* where - p is the autoregressive (ar) order - d is the number of ordinary differences - q is the moving average (ma) order - P is the Seasonal autoregressive (sar) order - D is the number of seasonal differences - Q is the Seasonal moving average (sma) order - s is the period.

8.1 First Differencing

Here we need to do the seasonal differencing for the further analysis. The seasonal difference of period for the series {Y_t} is defined as ∇sY_t = Y_t-Y(t−s)

Here we are doing the residuals after applying D=1.

8.2 Residuals Approach

8.2.1 Specification of seasonal part (D=1)

To do further analysis, we will generate time series plot, ACF plot and PACF plot by using following codes.

Now we will fit SARIMA(0,0,0)x(0,1,0)_12 model and display time series, ACF and PACF plots of the residuals.

m.eg= arima(eglog,order=c(0,0,0),seasonal=list(order=c(0,1,0), period=12),method = "ML")
res.m = residuals(m.eg);  
plot(res.m,xlab='Time',ylab='Residuals',main=" Time series plot of the residuals- Figure 15")

acf(res.m, lag.max = 48, main = " The sample ACF of the residuals- Figure 16")

pacf(res.m, lag.max = 48, main = " The sample PACF of the residuals- Figure 17")

From the above time series plot, we can observe that there is no specific trend in the plot and we cannot observe any changing variance in the time series.

From the ACF plot, we can observe 3 significant seasonal lags at 1, 2 and 4. And similarly in the PACF plot, we can observe 3 significant seasonal lags at 1, 2 and 3.

And the seasonal trend is removed from the time series after applying the first differencing itself, i.e., D=1.

8.2.2 Specification of seasonal part (D=1 P=3 Q=3)

Now we will fit SARIMA(0,0,0)x(3,1,3)_12 model and display time series, ACF and PACF plots of the residuals.

m1.eg= arima(eglog,order=c(0,0,0),seasonal=list(order=c(3,1,3), period=12),method = "ML")
res.m1 = residuals(m1.eg);  
plot(res.m1,xlab='Time',ylab='Residuals',main=" Time series plot of the residuals-Figure 18")

acf(res.m1, lag.max = 48, main = "The sample ACF of the residuals- Figure 19")

pacf(res.m1, lag.max = 48, main = "The sample PACF of the residuals- Figure 20")

From the time series plot, we can observe that there is no specific trend in the plot.

From the ACF plot, we cannot find any correlation between the values and there are no significant lag in the seasonal lags.

From the PACF plot, we can find slight correlation between the values and also there are significant seasonal lags at 1 and 3. But there is no significance at lag 2, so it can be disregarded.

8.2.3 Specification of Ordinary part (D=1 P=3 Q=3 d=1)

Here, we need to deal with the ordinary lag, so we did the first differencing and we generate the following outputs.

Now we will fit SARIMA(0,1,0)x(3,1,3)_12 model and display time series, ACF and PACF plots of the residuals.

m2.eg= arima(eglog,order=c(0,1,0),seasonal=list(order=c(3,1,3), period=12))
res.m2 = residuals(m2.eg);  
plot(res.m2,xlab='Time',ylab='Residuals',main="Figure 22: Time series plot of the residuals")

acf(res.m2, lag.max = 48, main = " The sample ACF of the residuals-Figure 23")

pacf(res.m2, lag.max = 48, main = " The sample PACF of the residuals- Figure 24")

There is no specific trend shown in the time series plot.

And we can observe that all seasonal lags or correlation are removed from the ACF plot and PACF plot.

By analyzing the ACF plot, we can take q value as 2 and from the PACF plot, we can take p value as 4 at the ordinal lags.

8.2.4 Specification of Ordinary part (D=1 P=3 Q=3 d=1 p=4 q=2)

Now we will fit SARIMA(4,1,2)x(3,1,3)_12 model and display time series, ACF and PACF plots of the residuals.

m3.eg= arima(eglog,order=c(4,1,2),seasonal=list(order=c(3,1,3), period=12))
res.m3 = residuals(m3.eg);  
plot(res.m3,xlab='Time',ylab='Residuals',main=" Time series plot of the residuals - Figure 23")

acf(res.m3, lag.max = 48, main = " The sample ACF of the residuals-Figure 24")

pacf(res.m3, lag.max = 48, main = " The sample PACF of the residuals- Figure 25")

From the above plots, we can see that the time series plot does not show any trend. And there is no significant lags showing near zero in ACF and PACF plot.

So by analyzing the above outputs, we obtain our final SARIMA model as SARIMA(4,1,2)x(3,1,3)_12 model.

8.3 EACF

The Extended Autocorrelation Function (EACF) method uses the fact that if the AR part of a mixed ARMA model is known, “filtering out” the autoregression from the observed time series results in a pure MA process that enjoys the cutoff property in its ACF.

Now we can use the EACF method on the residuals.

eacf(res.m2)
## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x o o o o o o o o o  o  o  o 
## 1 x x o o o o o o o o o  o  o  o 
## 2 x o o o o o o o o o o  o  o  o 
## 3 x o x x o o o o o o o  o  o  o 
## 4 x x x x o o o o o o o  o  o  o 
## 5 x x o x o o o o o o o  o  o  o 
## 6 x x o x o o o o o o o  o  o  o 
## 7 x x o x x o x o o o o  o  o  o

From EACF we consider the top left ‘0’ which is not interrupted by ‘x’ to find out model. The p and q values considered are: * p = 0,1 * q = 2,3

The models are (0,1,2), (0,1,3), (1,1,2) and (1,1,3).

8.4 BIC

Another widely used criterion in model selection is Bayesian Information Criterion (BIC). In BIC table we consider the nar and nma values to be 4 as at higher values the significant value changes drastically. The ‘test-lag’ represents the p value and ‘error-lag’ represents q value.

bic = armasubsets(y=res.m2,nar=5,nma=5,y.name='test',ar.method='ols')
plot(bic)

From BIC Table we can find the p and q values as: p = 0 q = 1,2,5

Models from BIC are (0,1,1), (0,1,2), (0,1,5).

8.5 ADF Test

adf.test(res.m2)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  res.m2
## Dickey-Fuller = -8.3136, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

From the above result, we can see that the p-value is 0.01 which is less than the significant alpha value 0.05 and therefore, we reject the null hypothesis.

8.6 Model Fitting

Final set of possible models are given below:

SARIMA(1,1,2)(3,1,3)

SARIMA(1,1,3)(3,1,3)

SARIMA(0,1,3)(3,1,3)

SARIMA(0,1,1)(3,1,3)

SARIMA(0,1,2)(3,1,3)

SARIMA(0,1,5)(3,1,3)

SARIMA(4,1,2)(3,1,3)

9 Parameter Estimations

Model fitting for parameter estimation is straight forward by the use of R. Here we are using the ML method and the coefficient test on the selected models one by one.

9.1 SARIMA(1,1,3)(3,1,3)

We do not consider this model as we get the ‘infinite error’ while doing parameter estimation.

9.2 SARIMA(1,1,2)(3,1,3)_12

m5_112 = arima(eglog,order=c(1,1,2),seasonal=list(order=c(3,1,3), period=12),method = "ML")
coeftest(m5_112)
## 
## z test of coefficients:
## 
##        Estimate Std. Error  z value  Pr(>|z|)    
## ar1  -0.1553239  0.0037071 -41.8992 < 2.2e-16 ***
## ma1  -0.3329596  0.0097444 -34.1695 < 2.2e-16 ***
## ma2  -0.3702565  0.0043663 -84.7980 < 2.2e-16 ***
## sar1 -0.4229632  0.0136229 -31.0479 < 2.2e-16 ***
## sar2 -0.3610211  0.0051360 -70.2927 < 2.2e-16 ***
## sar3 -0.2903663         NA       NA        NA    
## sma1 -0.3134615  0.0145895 -21.4855 < 2.2e-16 ***
## sma2 -0.2134780  0.0224584  -9.5055 < 2.2e-16 ***
## sma3  0.0108693         NA       NA        NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
residual.analysis(model = m5_112)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.9941, p-value = 0.4272

From the above z-test result, we can see that all ar, ma, seasonal ar and seasonal ma components are significant but we have two values are missing in the result. From the shapiro-wilk test, we can say that the p value is 0.4272 and so it indicates the normality of the data.

From the residual plots we can observe that:

  • There slight trend in the time series plot with little variance from 2000 to 2005.
  • The histogram follows a normal distribution with a slight left skew.
  • Almost all points are on the normality line for the QQ Plot.
  • The ACF plot follows a white noise series with one lag which is far from the zero.
  • Few points are on the line for the Ljung Box test.

9.3 SARIMA(0,1,3)(3,1,3)_12

m5_013 = arima(eglog,order=c(0,1,3),seasonal=list(order=c(3,1,3), period=12),method = "ML")
coeftest(m5_013)
## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ma1  -0.505135   0.065539 -7.7074 1.284e-14 ***
## ma2  -0.248325   0.074964 -3.3126 0.0009244 ***
## ma3  -0.094721   0.061580 -1.5382 0.1240067    
## sar1 -0.478492   0.344758 -1.3879 0.1651654    
## sar2 -0.116342   0.240470 -0.4838 0.6285190    
## sar3 -0.346442   0.130500 -2.6547 0.0079375 ** 
## sma1 -0.290651   0.353681 -0.8218 0.4111969    
## sma2 -0.532717   0.395619 -1.3465 0.1781284    
## sma3  0.254726   0.244266  1.0428 0.2970314    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
residual.analysis(model = m5_013)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.99629, p-value = 0.8155

From the above z-test result, we can see that two ma and one seasonal ar components are significant while all seasonal ma components and others are insignificant. From the shapiro-wilk test, we can say that the p value is 0.815, which is the largest p-value among all and so it indicates the normality of the data.

From the residual plots we can observe that:

  • There is no specific trend in the time series plot.
  • The histogram follows a normal distribution.
  • Expect for one point all are in line for the QQ Plot
  • The ACF depicts a white noise series.
  • All points are above the threshold limit in Ljung box test.

9.4 SARIMA(0,1,1)(3,1,3)_12

m5_011 = arima(eglog,order=c(0,1,1),seasonal=list(order=c(3,1,3), period=12),method = "ML")
coeftest(m5_011)
## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ma1  -0.569199   0.080085 -7.1075 1.182e-12 ***
## sar1 -0.466078   0.264775 -1.7603 0.0783606 .  
## sar2 -0.149082   0.189504 -0.7867 0.4314599    
## sar3 -0.365739   0.096361 -3.7955 0.0001474 ***
## sma1 -0.276432   0.278636 -0.9921 0.3211528    
## sma2 -0.481100   0.265105 -1.8148 0.0695617 .  
## sma3  0.184476   0.203832  0.9050 0.3654444    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
residual.analysis(model = m5_011)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.99236, p-value = 0.2169

From the above z-test result, we can see that the ma and one seasonal ar components are significant while all seasonal ma components and others are insignificant. From the shapiro-wilk test, we can say that the p value is 0.21 and so it indicates the normality of the data.

From the residual plots we can observe that:

  • There is no specific trend in the time series plot.
  • The histogram follows a skewed distribution.
  • Most of the points are in line for the QQ Plot.
  • The ACF plot have 2 slight significant lags.
  • Few points are below the threshold limit and most of the points are very close to the threshold limit in the Ljung Box Test.

9.5 SARIMA(0,1,2)(3,1,3)_12

m5_012 = arima(eglog,order=c(0,1,2),seasonal=list(order=c(3,1,3), period=12),method = "ML")
coeftest(m5_012)
## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ma1  -0.517120   0.063732 -8.1140 4.898e-16 ***
## ma2  -0.287665   0.069845 -4.1186 3.812e-05 ***
## sar1 -0.493069   0.289521 -1.7031  0.088559 .  
## sar2 -0.137822   0.213950 -0.6442  0.519460    
## sar3 -0.362463   0.110713 -3.2739  0.001061 ** 
## sma1 -0.266522   0.298319 -0.8934  0.371637    
## sma2 -0.506881   0.318809 -1.5899  0.111853    
## sma3  0.241554   0.219611  1.0999  0.271369    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
residual.analysis(model = m5_012)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.99547, p-value = 0.6667

From the above z-test result, we can see that the all ma and one seasonal ar components are significant while all seasonal ma components and others are insignificant. From the shapiro-wilk test, we can say that the p value is 0.66 and so it indicates the normality of the data.

From the residual plots we can observe that:

  • There is no specific trend in the time series plot.
  • The histogram follows a normal distribution.
  • Expect for one point all are in line for the QQ Plot.
  • The ACF depicts a white noise series.
  • Few points are on the threshold limit in Ljung Box test.

9.6 SARIMA(0,1,5)(3,1,3)_12

m5_015 = arima(eglog,order=c(0,1,5),seasonal=list(order=c(3,1,3), period=12),method = "ML")
coeftest(m5_015)
## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ma1  -0.500002   0.066031 -7.5723 3.667e-14 ***
## ma2  -0.229586   0.075094 -3.0573 0.0022333 ** 
## ma3  -0.019637   0.076410 -0.2570 0.7971870    
## ma4  -0.025942   0.075666 -0.3428 0.7317161    
## ma5  -0.104728   0.061618 -1.6997 0.0891964 .  
## sar1 -0.480851   0.278543 -1.7263 0.0842921 .  
## sar2 -0.131183   0.224429 -0.5845 0.5588711    
## sar3 -0.353962   0.105706 -3.3486 0.0008123 ***
## sma1 -0.255057   0.290093 -0.8792 0.3792785    
## sma2 -0.500414   0.290369 -1.7234 0.0848209 .  
## sma3  0.213334   0.228987  0.9316 0.3515211    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
residual.analysis(model = m5_015)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.99606, p-value = 0.7759

From the above z-test result, we can see that the two ma and one seasonal ar components are significant while all highest ma components are insignificant. Also all seasonal ma components and others are insignificant. From the shapiro-wilk test, we can say that the p value is 0.77 and so it indicates the normality of the data.

From the residual plots we can observe that:

  • There is no specific trend in the time series plot.
  • The histogram follows almost a normal distribution.
  • Expect for one point all are in line for the QQ Plot.
  • The ACF depicts a white noise series.
  • All points are above the threshold limit in Ljung box test.

9.7 SARIMA(4,1,2)(3,1,3)_12

m5_412 = arima(eglog,order=c(4,1,2),seasonal=list(order=c(3,1,3), period=12),method = "ML")
coeftest(m5_412)
## 
## z test of coefficients:
## 
##        Estimate Std. Error  z value  Pr(>|z|)    
## ar1  -0.4930919  0.0816641  -6.0380 1.560e-09 ***
## ar2   0.4042534  0.0855713   4.7242 2.311e-06 ***
## ar3   0.0539132  0.0850798   0.6337 0.5262911    
## ar4   0.1164633  0.0733462   1.5879 0.1123185    
## ma1  -0.0017942  0.0485947  -0.0369 0.9705466    
## ma2  -0.8824977  0.0445695 -19.8005 < 2.2e-16 ***
## sar1 -0.4343572  0.2589296  -1.6775 0.0934426 .  
## sar2 -0.1236933  0.2131106  -0.5804 0.5616326    
## sar3 -0.3548815  0.0993551  -3.5718 0.0003545 ***
## sma1 -0.3466967  0.2704010  -1.2822 0.1997874    
## sma2 -0.4846772  0.2762982  -1.7542 0.0793995 .  
## sma3  0.2300674  0.2203161   1.0443 0.2963650    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
residual.analysis(model = m5_412)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.994, p-value = 0.4118

From the above z-test result, we can see that the two ar, one ma and one seasonal ar components are significant while all seasonal ma components and others are insignificant. From the shapiro-wilk test, we can say that the p value is 0.411 and so it indicates the normality of the data.

From the residual plots we can observe that:

  • There is no specific trend in the time series plot.
  • The histogram follows almost normal distribution.
  • Expect for one point all are in line for the QQ Plot.
  • The ACF depicts a white noise series.
  • All points are above the threshold limit in Ljung box test.

9.8 AIC BIC Score

  • AIC is an estimate of the prediction error and therefore the relative quality of the statistical model for a given data set.
  • BIC is an estimate of the posterior probability function that the model is true under a certain Bayesian configuration.

To find the best model we use AIC/BIC score.

sc.AIC=AIC(m5_112,m5_013,m5_011,m5_012,m5_015,m5_412)
sc.BIC=AIC(m5_112,m5_013,m5_011,m5_012,m5_015,m5_412, k = log(length(egts)))

AIC Score

sort.score(sc.AIC, score = "aic")
##        df       AIC
## m5_412 13 -1025.668
## m5_015 12 -1024.510
## m5_013 10 -1024.198
## m5_012  9 -1023.936
## m5_112 10 -1018.411
## m5_011  8 -1011.371

BIC Score

sort.score(sc.BIC, score = "aic")
##        df       AIC
## m5_012  9 -992.1351
## m5_013 10 -988.8636
## m5_011  8 -983.1035
## m5_112 10 -983.0771
## m5_015 12 -982.1092
## m5_412 13 -979.7338

From the AIC and BIC score we observe that model m5_412, m5_015, m5_012 and m5_013 are the top models. Considering the residual results and the AIC/BIC score we obtain SARIMA(0,1,3)(3,1,3) as the best model fit for the time series.

9.9 Best Model

We consider SARIMA(0,1,3)(3,1,3) as the best model. While doing the residual analysis we can observe that:

  • From the parameter estimation we can observe that:
  • There is no trend in the time series plot of standardized residuals and no variance is also observed.
  • The histogram follows a normal distribution.
  • Most of the points are on line in the QQ Plot with very few points out of line at both tails.
  • There are no significant lag in the ACF plot implying a white noise series.
  • All points are above the threshold line in the Ljung Box Test.

Also it has the highest value in the Shapiro Wilk test.

9.10 BEST MODEL - over parametirization

Over parametirization is done to understand if there is better fit for the selected model.

m5_014 = arima(log(egts),order=c(0,1,4),seasonal=list(order=c(3,1,3), period=12),method = "ML")
coeftest(m5_014)
## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ma1  -0.513882   0.066380 -7.7415 9.824e-15 ***
## ma2  -0.221137   0.074336 -2.9748  0.002932 ** 
## ma3  -0.047655   0.072956 -0.6532  0.513625    
## ma4  -0.084136   0.070441 -1.1944  0.232314    
## sar1 -0.506153   0.314973 -1.6070  0.108060    
## sar2 -0.146686   0.230645 -0.6360  0.524787    
## sar3 -0.333712   0.110595 -3.0174  0.002549 ** 
## sma1 -0.240473   0.326117 -0.7374  0.460889    
## sma2 -0.504942   0.320614 -1.5749  0.115274    
## sma3  0.185115   0.240357  0.7702  0.441201    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
residual.analysis(model = m5_014)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.99616, p-value = 0.7925

There are few significant probability values for the AR/MA/SAR models. And the Shapiro Wilk test gives a p value of 0.7925 indicating normality of the dataset.

From the residual plots we can observe that:

  • There is no specific trend in the time series plot.
  • The histogram follows almost normal distribution.
  • Expect for one point all are in line for the QQ Plot.
  • The ACF depicts a white noise series.
  • All points are above the threshold limit in Ljung box test.

9.11 Forecasting - Prediction

We do the prediction of the time series with the selected model for the next 10 years and plot it.

egforecast = Arima(eglog,order=c(0,1,3),seasonal=list(order=c(3,1,3), period=12),method = "ML")
future = forecast(egforecast, h = 120)
plot(future,main = "Forecast for 10 years - Figure 26")

The prediction shows similar trend for the next 10 years. We can also observe that the interval increases as the year increases implying that there could be a upper or lower trend in the time series in future.

10 Summary and Conclusion

The Electric and Gas production dataset was analyzed to find the best model. We read the dataset and then converted it into time series object to understand the data description with the 5 major points and scatter plot. The model fitting was done with harmonic and seasonal model and we found that the models are not goof fit for the dataset. We then use Seasonal ARIMA model to understand the dataset. The time series dataset is log transformed and then seasonal differencing is applied to get a model. Using the seasonal differencing we find the seasonal lags and then handle ordinary lags. With this results we do EACF and BIC to find the best fit models for the time series.

SARIMA Models were analyzed using the residual method and we find the best model fit for predicting the next 10 years outcome.

11 Reference

  1. Board of Governors of the Federal Reserve System (US), Industrial Production: Utilities: Electric and Gas Utilities (NAICS = 2211,2) [IPG2211A2N], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/IPG2211A2N, May 25, 2021.
  2. Home - RDocumentation. (2021). Retrieved 11 June 2021, from https://www.rdocumentation.org/
  3. Cryer, Jonathan D., and Kung-Sik. Chan. Time Series Analysis With Applications in R . 2nd ed. 2008. New York, NY: Springer New York, 2008. Web.
  4. Demirhan, H. (2021). Math1318 Time Series Analysis[Lecture Notes].Canvas@RMIT University. https://rmit.instructure.com/courses/79716/modules